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Abstract 

Background: We studied patterns of molecular adaptation in the wild Mediterranean legume Medicogo truncotulo. 
We focused on two phenotypic traits that are not functionally linked: flowering time and perception of symbiotic 
microbes. Phenology is an important fitness component, especially for annual plants, and many instances of 
molecular adaptation have been reported for genes involved in flowering pathways. While perception of symbiotic 
microbes is also integral to adaptation in many plant species, very few reports of molecular adaptation exist for 
symbiotic genes. Here we used data from 57 individuals and 53 gene fragments to quantify the overall strength of 
both positive and purifying selection in M. truncatula and asked if footprints of positive selection can be detected 
at key genes of rhizobia recognition pathways. 

Results: We examined nucleotide variation among 57 accessions from natural populations in 53 gene fragments: 5 
genes involved in nitrogen-fixing bacteria recognition, 1 1 genes involved in flowering, and 37 genes used as 
control loci. We detected 1757 polymorphic sites yielding an average nucleotide diversity (pi) of 0.003 per site. 
Non-synonymous variation is under sizable purifying selection with 90% of amino-acid changing mutations being 
strongly selected against. Accessions were structured in two groups consistent with geographical origins. Each of 
these two groups harboured an excess of rare alleles, relative to expectations of a constant-sized population, 
suggesting recent population expansion. Using coalescent simulations and an approximate Bayesian computation 
framework we detected several instances of genes departing from selective neutrality within each group and 
showed that the polymorphism of two nodulation and four flowering genes has probably been shaped by recent 
positive selection. 

Conclusion: We quantify the intensity of purifying selection in the M. truncatula genome and show that putative 
footprints of natural selection can be detected at different time scales in both flowering and symbiotic pathways. 



Background 

Understanding the mechanisms underlying adaptation of 
species to their environment is central to population 
and evolutionary biology. Recent advances in large-scale 
sequencing and genotyping now allow rapid characteri- 
zation of molecular diversity in a set of local populations 
or across an entire species range but reliably identifying 
the molecular footprints of adaptation remains challen- 
ging [1,2]. Two analytical approaches can be used to 
identify sequences genes whose evolution has been dri- 
ven by environment-specific selection. One avenue is to 
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search for genomic regions exhibiting an excess of dif- 
ferentiation between populations, typically using F S t as 
a test statistic [3,4]. This approach is most likely to be 
successful when used in conjunction with a set of popu- 
lations spanning a clear environmental gradient [5-7] 
Barring a clear environmental gradient, one can still use 
allele(s) frequency information, the so-called site fre- 
quency spectrum (SFS), often summarized by a unique 
statistic such as Tajima's D to detect departure from 
selective neutrality [8]. These two methods are based on 
different aspects of polymorphism data, and thus are 
expected to be complementary to one another. More- 
over, combining the methods may allow for the detec- 
tion of different forms of selection (i.e. local adaptation 
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and uniform directional selection through a species 
range) or increasing power to detect more complex 
selection histories, for example a selective sweep within 
a single population that can drive an excess of differen- 
tiation merely by reducing the level of intra-population 
diversity. Although genome-wide study of molecular 
variation and scan for footprints of selection are now 
routinely reported in major genetical models, candidate 
gene-centred studies allow for testing of specific hypoth- 
esis regarding the evolutionary history of specific func- 
tional categories of genes [9]. 

For plants, timing of flowering is an essential compo- 
nent of fitness. Flowering can be influenced by a variety 
of climatic factors such as day length, average tempera- 
ture and drought stress. Accordingly numerous 
instances of molecular adaptation have been reported 
for flowering genes, particularly in the annual model 
plant Arabidopsis thaliana [10-13]. The ability of a 
plant to engage in successful biotic interactions is also 
integral to fitness. Although numerous work documents 
the strong imprint of natural selection on plant genes 
encoding perception of pathogens (so called R genes), 
both in A thaliana but also a suite of other plant species 
[14-16], much less is known on genes encoding the per- 
ception of beneficial (symbiotic) partners. The relative 
paucity of studies might be explained by the fact that 
the plant model A. thaliana does not engage in symbio- 
sis with any of the two major plant-microorganism sym- 
bioses: the ancient and widespread association with 
mycorrhizae and the more specialized symbiosis with 
rhizobia. 

The barrel medic Medicago truncatula is a model 
legume species used for dissecting the genetics of 
legume-rhizobia and plant-mycorhizae interactions. Its 
natural habitat spans the Mediterranean basin. It is an 
annual selfer found in ecologically unstable populations, 
resulting in large rates of genetic drift within and 
among populations [17,18]. For these reasons the effi- 
cacy of natural selection to drive adaptive changes is 
likely to be challenged by genetic drift and population 
structure. 

In this study we examined signatures of natural selec- 
tion in several genes of two functionally important and 
largely independent signalling pathways: the flowering 
and symbiotic pathways in M. truncatula. To do this, 
we surveyed patterns of naturally occurring diversity at 
the species level for a set of genes in each pathway and 
a set of control genes. We first quantified the strength 
of purifying selection in M. truncatula and the relative 
importance of positive selection in genes from both 
pathways and a set of independent control genes. We 
then examined broad patterns of genetic subdivision in 
our sample and split our sample in two ancestral 
groups. Accordingly, we model demographic factors that 



influence coalescent history and contemporary levels 
and patterns of diversity within each group [2], The 
rationale for doing so is that strong substructure can 
blur inference of non-neutral evolution [19]. Finally we 
searched for signatures of natural selection based on the 
estimated large-scale population structure, within both 
ancestral groups with the objective of tracking signa- 
tures of selective sweeps in candidate genes. 

Methods 

Plant material 

A panel of 57 Medicago truncatula genotypes (hereafter 
accessions) consisting of naturally occurring inbred lines 
sampled from throughout the species range was chosen 
to represent the species level and a regional level. At the 
species level, we used a core collection containing 32 
accessions, that was previously defined to represent the 
geographic distribution of the species and to maximize 
allelic variation at 13 SSR loci initially investigated on a 
sample of 347 accessions [20]. To investigate genetic 
diversity at a smaller spatial scale, 25 accessions belong- 
ing to a more homogeneous genetic group (as inferred 
through microsatellite genotyping) were included in the 
panel. Most of these accessions were initially collected 
in the Iberian Peninsula and Morocco. The list of acces- 
sions used is listed in Additional file 1, Table SI and 
their geographic origin depicted in Figure 1. Five other 
Medicago species were included as potential outgroups 
(one accession per species): M. tornata, M. rigidula, M. 
rigiduloides, M. arabica, and M. marina. Unless noted 
otherwise, all accessions were collected in natural popu- 
lations and self-fertilized two generations before being 
grown for DNA extraction. 

Amplification and sequencing 

We used data from 37 genie fragments that we PCR 
amplified and sequenced to model the demographic 
history of our sample (see below). Primers to amplify 
these fragments were chosen from two sources [21,22] 
and within the set of primers available, we selected pri- 
mers that amplify relatively long fragments (Additional 
file 2, Table S2). Fragments typically consisted of 
exon-bound primers spanning one or more introns. 
The fragment length ranged from 287 to 8897 with a 
mean of 1644 (Additional file 3, Table S3) and all PCR 
products were sequenced directly. Protocols for DNA 
extraction, PCR amplification, fragments purification 
and Sanger sequencing were identical to those 
described in De Mita et al. [23]. 

Nodulation gene fragments are located in 5 genes: 
DM11, DM13, NIN, NFP, PG3, involved in the Nod fac- 
tor perception pathway [24]. Primers used for PCR 
amplification and subsequent sequencing were identical 
to the ones described in De Mita et al. [23]. Flowering 
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Figure 1 Geographic origin of the accessions. Colour-coding refers to the assignation in two groups as inferred using Instruct for K = 2. 
Group 1 (red) comprises 25 accessions and is broadly distributed geographically. Group 2 (blue) consists of n = 32 accessions. 



gene fragments are located in the following genes: COL 
(Constans-Like), FCA2, FRLa, FTLa, FTLb, FTLc, Soclb, 
VRN, PHYa, PHYb, PHYe, which are involved in the sig- 
nalling pathway controlling day length perception, circa- 
dian clock and flowering inception [25]. Two gene 
fragments, FCA2 and Soclb, were amplified from cDNA. 
For these, total RNA was extracted from fresh leaves 
using TRI REAGENT™ (T9424, Sigma®) buffer and the 
Reverse Transcription System kit from Promega®. 
Sequences of primers used for PCR amplification and 
sequencing are available available in Additional file 2, 
Table S2. 

For all gene fragments, sequence reads base calling 
and annotation were performed manually using Pregap4 
and Gap4 programs of the Staden software version 1.6 
[26]. Annotation was retrieved either from reference 
gene sequences when available or from Gene Index 
Tentative Contigs [27] exhibiting an exact match. All 
sequences are deposited in Genbank (HQ735420 - 
HQ738281) and alignments are available from the 
authors in the Dryad archive (see Availability of sup- 
porting data). 

Analysis of patterns of polymorphism and divergence 

Analyses of polymorphism were performed using custo- 
mized scripts in Python and the SeqLib library version 
1.6 (S. De Mita and M. Siol, unpublished) incorporating 
Bio++ version 1.9 [28]. Sequences spanning less than 
25% of the length of the longest sequence of the align- 
ment (ignoring all undefined or missing data) were 
removed. This threshold was raised to 60% for the ana- 
lysis of synonymous/ non- synonymous variation. The M. 



tornata sequence was used as outgroup unless the M. 
tornata sequence was less than 75% of the length of the 
longest sequence. In these cases, the longest sequence 
from the other outgroup species was used (see Addi- 
tional file 3, Table S3). For polymorphism analysis con- 
cerning synonymous/non-synonymous variation, 
estimation of the recombination rate and haplotype ana- 
lysis, all sites with at least one undefined or missing 
nucleotide position were excluded. For all other statis- 
tics, sites with at least 70% defined data were used. 

Counts of the number of sites exhibiting polymorph- 
ism and divergence within each gene fragment were 
used to study the relative roles of positive and purifying 
selection in shaping polymorphism and divergence in 
the coding regions. We followed the approach imple- 
mented by Welch [29] and recently extended by Obbard 
et al, [9] to infer the fraction l-/of mutation under 
strong purifying selection and the proportion a of non- 
synonymous nucleotide divergence driven by positive 
selection. We used a series of nested models of increas- 
ing complexity ranging from the simplistic pure neutral- 
ity (f = 1, a = 0) to models incorporating a variable 
intensity of positive or purifying selection (/and a are 
allowed to vary according to each gene category). Esti- 
mation of parameters for each gene or category of gene 
was made using the software MK_test [29]. We used 
Akaike's Information Criterion (hereafter AIC) to com- 
pare models, as suggested by Welch [29]. When models 
differed by less than 5 AIC units, we used an averaging 
procedure to obtain robust estimates of a's and f s para- 
meters. Briefly, we computed a model-averaged estimate 
of as (and/s) as the weighted average of the individual 
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estimates obtained under each model and used the dif- 
ferences in AIC of each competing model for weighting 
[30]. 

Population structure analysis 

We used the Bayesian clustering algorithm implemented 
in InStruct [31] to analyse population structure. InStruct 
uses multilocus genotype to assign portions of individual 
genomes to a predetermined number (K) of clusters and 
can account for the likely presence of inbreeding within 
clusters (given the selfing rate of M. truncatula). We 
used all polymorphic sites detected on control loci. Each 
site was considered as a single locus and we used a hap- 
loid setting. We ran Instruct, using Mode 1, which 
infers population structure only (no inference on the 
selfing rate) and assumes that admixture is possible 
between groups. For each run, we used a burn-in period 
of 5 x 10 5 iterations and then 10 6 iterations for estimat- 
ing the parameters. Five independent chains were runs 
for each of 5 K values, K = 1 to K = 5. For each run, 
individual accessions were assigned into a group accord- 
ing to their highest proportion of membership into this 
group. The clusters thus defined were compared among 
runs for each K value and the best fitting K was chosen 
using the AK method of Falush et al. [32]. The fraction 
of the observed genetic variation explained by the 
inferred clusters was measured by the Fst parameter as 
implemented for nucleotide sequence data [33]. The 
stratification obtained for K = 2 appeared as the most 
robust one (see results) and was used in neutrality tests. 

Frequency spectrum neutrality test 

We used patterns of polymorphism within each group 
to detect instances of selection causing departure from 
the neutral frequency spectrum. To do so we devised a 
test combining two neutrality test statistics relying on 
the underlying frequency spectrum: Tajima's D [8] and 
the standardized version of Fay and Wu's H [34,35] 
which we call Z. In order to obtain the joint distribution 
of these statistics under the hypothesis of selective neu- 
trality while accounting for demographic history in each 
group, we followed the strategy used in De Mita et al. 
[23] and fitted population expansion models in each of 
the two groups. The procedure was repeated indepen- 
dently for each group and a different scaled mutation 
rate and rate of expansion were assumed for each 
group. We used an Approximate Bayesian framework 
[36] as implemented in the Python SeqLib library to 
obtain the joint posterior distribution for parameters. 

We used the frequency of four minor allele frequency 
categories (0.125, 0.25, 0.375 and 0.5) as summary statis- 
tics of the data in each group. Then we approximated 
the joint posterior with 10 7 coalescent simulations and 
using local-linear regression on tan-transformed data 



with a tolerance factor of 10" 3 . The posterior joint distri- 
bution (scaled mutation rate and exponential population 
growth rate) was binned using 5184 evenly distributed 
bins over both dimensions. This posterior distribution 
was then used to obtain - through 10 6 independent 
simulations (by sampling a mutation and expansion rate 
to parametrize each coalescent simulation) - the distri- 
bution of D and Z expected at each locus. Note that 
locus specific sample size and alignment length were 
used in these simulations. A probability of misorienta- 
tion of sites of 0.00936 was estimated from control loci 
[37] and incorporated in the simulations used to obtain 
the null distribution of (D, Z) at each locus. This was 
done by randomly switching the orientation (ancestral 
versus derived) of alleles after each coalescent simula- 
tion stage. This additional twist is an effective way to 
guard false positives, i.e. low p-values at locus evolving 
neutrally but where SNP misorientation is inflating Z. 
The (D, Z) distribution was discretized over 576 cate- 
gories and the p-value of the bin containing the 
observed (£>, Z) value of each locus in our data was 
computed. 

Evolutionary analysis of nucleotide substitution patterns 

Patterns of between-species substitutions of DMI1, a 
gene previously identified as candidate for adaptation 
[23], were also analysed. To do this we sequenced most 
of the protein-coding sequence of DMI1 (same default 
protocol) in nine additional Medicago species (M. 
ciliaris [L0897E], M. laciniata [L0904E], M. noeana 
[L0908E], M. polymorpha [L0911E], M. orbicularis 
[L0913E], M. sauvagei [L0928D], M. carstiensis 
[LCarsX], M. coerulea [LCoerX], M. ruthenica [LRuth]) 
as well as previously cited M. tornata, M. ciliaris, M. 
rigidula, M. rigiduloides, M. marina and the L0736D 
accession from M. truncatula. In addition, we included 
the reference sequence M. truncatula Jemalong A17 
from the literature (DMI1 mRNA sequence available as 
Genbank accession number AY497771). 

Sequences were aligned at the protein level using 
MUSCLE version 3.8.31 with default options [38] with 
minor manual corrections to eliminate spurious amino 
acid replacements generated by the alignment proce- 
dure. For the same reason, the region spanning position 
316 to 382 in the coding DNA sequence alignment 
(positions 280-303 in AY497771) consisting of gaps and 
poorly aligned sequences were removed. A molecular 
phylogeny was built with PhyML version 3.0 [39] using 
coding and protein sequences. We used the most con- 
servative option of using the protein-based tree for 
further analysis as it minimizes the number of assumed 
amino acid changes. Using this phylogeny, we applied 
tests of positive selection based on codon-based substi- 
tution models [40] implemented in PAML version 4.2 
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[41]. The tests M2a versus Mia and M8 versus M8a 
yielded identical results and we report the first compari- 
son only. We applied the test as recommended in 
PAML documentation and identified positively selected 
sites using a Bayes Empirical Bayes procedure. 

Results 

Levels of nucleotide variation in Medicago truncatula 

We surveyed nucleotide diversity within M. truncatula 
by re-sequencing 53 gene fragments (37 control loci, 11 
flowering-related and 5 nodulation-related). A total of 
100,178 nucleotides were re-sequenced of which 74,233 
were included in statistical analyses (after excluding sites 
with excessive missing data or alignment gaps). Gene 
fragments were sequenced in 57 M. truncatula acces- 
sions representing the whole-species diversity and in 
one outgroup. Overall M. truncatula is characterized by 
a fairly low level of nucleotide diversity at the species- 
wide level as measured through the scaled mutation rate 
(mean Watterson's 0 = 0.0044 per base pair (bp 1 ) and 
mean nucleotide diversity is tt = 0.0034, see Table 1), 
but in the range of studies surveying species with similar 
mating system: species-wide samples in Arabidopsis 
thaliana (9w = 0.008 and n = 0.006 [42]), in wild rice 
(0.0037 in Oryiza rufipogon [43]), and the self fertilizing 
worm C. elegans (tt = 0.003 for silent polymorphism 
[44]) or within autogamous populations of Eichornia 
paniculata (tt = 0.0008 [45]). Levels of nucleotide varia- 
tion for control and candidate genes fragments are vir- 
tually identical (Figure 2). One notable exception is the 
nodulation gene DMI3 that exhibits very high levels of 
polymorphism (Table 1 and Additional file 4, Table S4). 



Table 1 Summary of nucleotide polymorphism in control 
gene fragments 



Average Statistic 


Whole sample 


Group 1 


Group 2 


Number of sequences used 


50 


22 


28 


Number of sites analysed 


1208.27 


1184.97 


1217.70 


Number of polymorphic sites 


32.46 


19.27 


25.03 


Watterson's 9 & 


0.004421 


0.003231 


0.003662 


Nucleotide diversity (n) & 


0.003409 


0.002766 


0.003185 


Taj i ma's D 


-0.81 


-0.73 


-0.58 


Fu and Wu's 


0.04 


0.03 


-0.05 


F S t between group 1 and 2 


0.156 


N/A 


N/A 


Hudson's p 


0.063 


0.020 


0.164 



All values are given as averages over the 37 control loci. 

Group 1 and Group 2 are the two groups inferred using Instruct on sequence 

data obtained on control loci: Group 2 is the Spain-Morocco group and 

Groupl the eastern part of the sample (Figure 1) 

&: per site estimates are reported here, 

#: Z standardization of Fay and Wu's H. 

N/A: not applicable. 

The value Hudson's p reported here is the average of p estimates per 
analysed site. 



Beyond average levels of polymorphism, we document 
a rather large variance in the scaled mutation rates as 6 
estimates span two orders of magnitude (from less than 
0.0005 bp" 1 at LEG722, where no segregating sites were 
observed over 500 bp, to 0.02051 bp" 1 at LAX4 and 272 
observed segregating sites over 8900 kb). 

Overall, values for Tajimas D statistic in the whole sam- 
ple vary widely but are generally skewed towards negative 
values (minimum -2.60, maximum 1.05, average -0.79, 
median -0.89). Candidate gene fragments and control frag- 
ment distributions are largely overlapping (Figure 2). 

The strength of purifying selection and adaptive 
evolution in Medicago truncatula 

Next, we estimated the fraction l-/of amino acid chan- 
ging mutation undergoing strong purifying selection and 
the proportion a of non-synonymous divergence 
between M truncatula and our outgroup (M. tornata) 
that can be attributed to positive selection. We used 5 
competing models - with varying amounts of purifying 
and positive selection - and their relative fit is summar- 
ized in Table 2. Model M0, which posits strict neutral- 
ity, fits the data poorly and was not considered further. 
Model comparison using differences in AIC provides 
suggestive evidence that, despite low level of polymorph- 
ism compatible with a rather small long-term effective 
size, a substantial amount of purifying selection con- 
strains amino acid changing polymorphism. On average, 
about 90% of amino acid-changing mutations are 
removed by strong purifying selection. This is as strong 
or stronger than what has been reported for A. thaliana 
where 66% of amino-acid changing mutation where 
strongly deleterious [46]. 

The estimated level of adaptive evolution is high (a - 
0.25) but there is a very large sampling variance and this 
estimate is not statistically significant from zero (likeli- 
hood ratio test comparing M2 versus M4 = 3.2, p = 0.07). 
Last, we asked if a model (M5) positing that different 
types of gene fragments experienced different level of 
adaptive substitution improved the fit of the data. Com- 
parison with a model with homogeneous level of adaptive 
substitutions (M4) suggests that flowering genes may 
experience more adaptive substitution (a = 0.37) and 
that symbiotic genes experience fewer substitution (a = 
-0.04) but again sampling variance around the estimates 
of a for each category of genes is very large and the 
model M5 overall fits poorly the data relative to a more 
parsimonious one (M4). We thus choose to report more 
robust model averaged estimates of /and a in Table 2. 

Broad patterns of genetic subdivision in Medicago 
truncatula 

We studied broad patterns of genetic subdivision in our 
sample and inferred stratification in K groups within 
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Figure 2 Box plots summarizing patterns of nucleotide variation in Medicago truncatula. Boxplots (shaded gray) depict the empirical 
distribution obtained for control fragments. Dots represent individual flowering candidate genes (orange) and symbiotic genes (green). A: 
Distribution of the scaled mutation rate (as estimated with Watte rson's 9) per bp for each fragment. B: Pairwise nucleotide diversity (n). C: 
Tajima's D statistic for each fragment. Z: standardized Fay and Wu's statistic. 



our species-wide sample. We used values of K ranging 
between 1 and 5. Consistent estimates of log-likelihood 
and stratification patterns were obtained for all K values. 
Posterior probabilities of the data always increased with 
increasing subdivision {i.e. higher K); the highest gain 
occurring between K = 1 and K = 2 and between K = 2 
and K = 3 (Additional file 5, Figure SI). 



At K = 2, the inferred structure was stable across runs, 
geographically consistent, and in agreement with a pre- 
vious structure inferred on the basis of SSR variation in 
a set of 13 microsatellite loci on a much larger sample 
(n = 347) of accessions [20]. This stratification splits our 
species-wide sample into a group of 32 accessions origi- 
nating from the South-West of the Mediterranean basin 
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Table 2 Intensity of purifying selection (1-fl and levels of adaptive evolution (a) 



Model name and description 


f 




a 




Aaic 


W A | C 


MO: Strict selective neutrality 


1 




0 




321 


~ 0 


M1: Homogeneous purifying selection + 
No adaptive evolution 


0.13 




0 




2.16 


0.06 


M2; Purifying selection with variable intensity + 
No adaptive evolution 


f — 
M - 

f 2 = 

fs = 


n 1 1 

U. I I 

0.17 
0.08 


u 




1 1 a 


n 1 (\ 

U. I D 


M3 Homogeneous purifying selection + homogeneous levels of adaptive evolution 


0.1 1 




0.36 




0.79 


0.24 


M4 Purifying selection with variable intensity + homogeneous levels of adaptive evolution 


u = 
f 2 = 
h = 


0.10 
0.14 
0.07 


0.35 




0 


0.52 


M5: Purifying selection with variable intensity + varying levels of adaptive evolution 


fy = 
h = 
h = 


0.09 
0.14 
0.09 


a! = 
a 2 = 
a 3 = 


0.4 

0.37 

-0.04 


3.48 


0.02 


Model average (over MO to M5) 


u = 
h = 
h = 


0.08 
0.11 
0.06 


OCi = 

a 2 = 
a 3 = 


0.28 

0.028 

0.027 


N/A 


N/A 



1-f quantifies the intensity of purifying selection through the fraction of new amino acid changing mutation under strong purifying selection, a is the fraction of 
divergence attributable to adaptive evolution (driven by positive selection on amino acid changing mutations). 

The best model (M4) is highlighted in bold and all differences in AIC among models A A , C ) are computed relative to M4. W A , C are the statistical weights of each 
model in the model averaging procedure we used. 

Subscripts in f lf f 2 , f^, oii, a 2 , a 3 refer to the gene fragment categories. 1: control fragments, 2: flowering gene fragments, 3: symbiotic gene fragments. N/A: not 
applicable. 



and a group of 25 accessions originating from a broad 
Eastern part of the Mediterranean basin (Figure 1). 
Except for three accessions (L00213, L00368 and 
L00401), all the accessions in this study were assigned 
to the same group as when using SSR data. This stratifi- 
cation accounted for 17% of the total genetic variability 
in our data. At K = 3, independent runs were also con- 
sistent, but the stratification obtained was not geogra- 
phically consistent, splitting the two groups defined at K 
= 2 with no clear pattern (Additional file 6, Figure S2). 
This stratification also explained less genetic variability 
than K = 2. For K = 4, the stratification was the same as 
that obtained with K = 3 except for one individual 
(L00213) which constituted alone the fourth group. At 
K - 5, most accessions appeared admixed, i.e. showing 
proportions of membership into each of the 5 clusters 
lower than 0.3 or 0.4. 

Given the geographic consistency of the subdivision 
obtained at K = 2 and the congruence of this subdivision 
with that obtained using SSR loci, we choose to use K - 2 
as the subdivision pattern for further analyses. Note that 
this analysis is by no means intended to infer the actual 
number of populations underlying our sample but merely 
a way to group accessions in fairly homogeneous groups. 
The two groups show relatively similar levels and pat- 
terns of nucleotide variation (Table 1) despite the lower 
geographic area covered by the Western group (referred 
as group 2). Both groups also exhibit negative Tajimas D 
values, suggesting recent events of population expansion 
occurred in both groups (see Additional file 7, Table S5 
for details on diversity statistics per fragment). 



The frequency spectrum at several flowering genes 
departs from neutrality and suggests recent or ongoing 
selection 

To determine if patterns of polymorphism departed 
from selective neutrality we combined two statistics, D 
and Z, summarizing the frequency spectrum. This level 
of analysis strikes a compromise between retaining high 
enough polymorphism while using a sample for which 
we can model patterns of polymorphism (in order to 
obtain null distributions for neutrality test statistics). 
Our approach builds on previous work showing that a 
scenario assuming a single population with exponential 
growth, although simplistic, can adequately capture pat- 
terns of polymorphism in control gene fragments [23]. 

We mapped the observed values (D, Z) of each gene frag- 
ment on the joint density of (D, Z) expected under selective 
neutrality while accounting for the demographic history of 
each group (see methods for details). A graphical summary 
of our results is presented in Figure 3 (p-values are in Addi- 
tional file 8, Table S6). P-values of the joint test (D, Z) 
obtained in the two groups exhibit virtually no correlation 
(observed correlation of 0.12 across n = 46 loci, p > 0.4). 
Empirical distribution of p-values at control loci was also 
slightly L-shaped (not shown), suggesting that despite our 
efforts to account for the coalescent history of our sample, 
our test is still probably slightly too liberal so that p-values 
should be interpreted with caution. Among the 16 surveyed 
candidate genes, observed patterns of nucleotide diversity 
(as summarized by values of D and Z) fall outside the simu- 
lated neutral distribution for 7 fragments: FRLa and DMI3 
exhibit Z values in both groups that are larger than 
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Figure 3 Tests of selective neutrality of polymorphism within each group. The joint distributions of (D, Z) tests statistics expected under 
neutrality in the Eastern group (group 1, panel 3a) and the Western group (group 2, panel 3b) are plotted using a blue shading for the 
probability density. Symbiotic genes are plotted as green dots and flowering genes as orange dots using each candidate gene abbreviation. 
Control loci are represented as black dots. The joint distribution for (D, Z) within each group was obtained through 10 5 coalescent simulations 
from models parametrized to fit patterns of polymorphisms in the set of controlled fragments (See Methods for further details). Note that in 
order to generate a unique graphical representation of the neutral joint distribution, simulations for each of the 53 loci were pooled and the 
resulting (D, Z) distribution was binned using 10 4 categories (rigorous p-values computed using null distributions tailored for each locus length 
and polymorphism are available in Additional file 7, Table S5). 



expected under neutrality, FTLb and PHYa exhibit too 
negative Z values in the Western group 2, VRN and PHYe 
exhibit too negative D values in the (Western) group 2, and 
AfflV exhibits a negative Z value in the Eastern group (group 
1). Several loci - including control loci - exhibited low p- 
values and inspection of the (A Z) coordinates reveals loci 
with abnormally high values of Z but typical values of D. 
Negative values of Z sign an excess of high-frequency 
derived mutations, which can be interpreted as hitch-hiking 
episodes driven by recent selection. In contrast, occurrence 
of positive Z values (deficit of high-frequency derived muta- 
tions) cannot be explained by natural selection in any 
straightforward manner. Notably, genes displaying high 
values of Z often exhibit the highest diversity indices (some- 
times by orders of magnitude compared to other genes). Z, 
unlike D, might be sensitive to variations in mutation rate. 
Even worse, Z is very sensitive to variation of gene geneal- 
ogy structure, as deep genealogies might pre-date the diver- 
gence with outgroups and drive spuriously high values of Z. 
As a result, we chose not to report as departing from neu- 
trality loci exhibiting a very high value of Z and a high 
diversity, as for these an artefact due to high mutation rate 
seems likely (see below how sensitivity of Z in the [D, Z] 
test could be addressed). 

Barring such cases, this method detected footprints of 
selection in four genes potentially involved in flowering 
date (FTLb, PHYa, VRN and PHYe) and one symbiotic 
gene (AfflV). The gene FTLb displayed in the Western 
group a very negative Z value compared to neutral expec- 
tations, suggesting positive selection on this gene while 
PHYa exhibited a negative Z value but also a strikingly 
large D value in this group, as expected under diversifying 
selection or local adaptation. Both VRN and PHYe, dis- 
played very negative D values as expected under 



directional selection. For these two genes, however, the 
deviations from expected values were less important 
resulting in larger p-values (p = 0.025 and 0.023 respec- 
tively compared to 0.0066 for FTLb and 0.0006 for PHYa). 
NIN exhibited a negative Z value as expected under posi- 
tive selection in the Eastern group only (p = 0.019). 

Evidence for selection in DM11 

DMI1 was the only nodulation candidate detected in a 
previous analysis [23] but is not singled out in the cur- 
rent application of the (D-Z) test. However several SNPs 
located in exon 6 of DMI1 exhibited an excess of differ- 
entiation. The analysis of local variation of patterns of 
polymorphism shows interesting features (Figure 4), 
although challenging to interpret. The strikingly con- 
trasted pattern of differentiation and Tajima's D might 
explain why this gene could not be detected using gene- 
wise summaries of polymorphism and suggest that 
selection occurred in the gene (see the Discussion sec- 
tion for possible interpretations). Codon evolution ana- 
lysis strongly supports the hypothesis that several sites 
in DMI1 evolved under positive selection in the history 
of the Medicago genus. A model allowing for positive 
selection on a fraction of the sites (M2a) fitted much 
better the data than a model (Mia) with no sites under 
positive selection (likelihood ratio test, p = 0.016). The 
location of the sites that are very likely to have evolved 
under positive selection is represented in Figure 4. 

Discussion 

M. truncatula nucleotide polymorphism is shaped 
primarily by purifying selection 

The species-wide levels of nucleotide polymorphism 
we report here are in the lower range of plant studies 
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Figure 4 Sliding window analysis of the gene DM11. Polymorphism was analysed in 500 nucleotide-long windows with 50 nucleotide steps 

along the alignments. Displayed statistics are: S, the number of polymorphic sites per window, non-synonymous polymorphism given by 

Watterson's 9 per non-synonymous site, F ST given by Weir and Cockerham's estimator [59] and Tajima's D computed in both groups. The grey 

frames denote the positions of exons and the arrowheads mark the position of the sites found with high posterior probability (> 0.95) to be 

targets of positive selection in the Medicogo genus. 
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which is consistent with the expectation that selfing 
plant species harbour less nucleotide diversity than 
outcrossing species [47]. Note however that although 
mating system was the variable explaining most of the 
variation in polymorphism across plant species, the 
portion of the total variance in nucleotide diversity 
accounted by mating system alone is rather modest 
(less than 10% after phylogenetic relatedness is 
accounted for). Levels of polymorphism in M. trunca- 
tula thus lie in the low range although these are by no 
mean as extreme as the striking instance of some self- 
fertilizing species - Capsella rubella that recently 
evolved from the outcrossing Capsella grandiflora - 
which only displays one to two haplotypes throughout 
its species range [48,49]. We also report fairly strong 
purifying selection removing up to 85% of all non- 
synonymous (amino acid changing) mutations and 
marginal evidence for heterogeneity in the amount of 
purifying selection and positive selection among gene 
categories. Although the evidence for positive selection 



on flowering genes obtained by contrasting polymorph- 
ism and divergence (Table 2) is suggestive and also 
consistent with evidence obtained on a different time 
scale using the (D, Z) test, estimates of a should be 
viewed as preliminary for two reasons. First, a esti- 
mates are inherently noisy [9] and data on many more 
genes for each category would be needed to reduce 
sampling variance around estimates. Second demogra- 
phy is expected to affect the estimation of both / and 
a [29]. 

Model-based tests of selection 

Instead of relying on non-parametric methods for 
detecting outliers suggestive of selection, we employed 
model-based methods. These rely on first fitting pat- 
terns of nucleotide variation at a set of control loci as a 
means of parametrizing a model positing selective neu- 
trality and the most salient features of the demographic 
history that may be responsible for the patterns of 
nucleotide diversity in our sample. We discussed below 
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insights provided by these approaches but also their 
limits. 

Some loci, including controls, displayed large values for 
Fay and Wu's standardized H statistic (Z) that were 
strongly unlikely under our model. These genes also 
ranked among those with the highest per-site 
Wattersons0. While D is virtually insensitive to levels of 
polymorphism, Z and even more the non-standardized H 
statistic are dramatically so. Allowing for variation of the 
mutation rate among genes would make the (D, Z) test 
more robust. One option is to constrain 9 to be propor- 
tional to an independent estimate of mutation rates or a 
proxy such as interspecific synonymous nucleotide diver- 
gence. The outgroups we used were however too close - 
in terms of absolute amount of divergence to our sample 
- for this purpose, yielding putatively very noisy estimates 
for mutation rates that would exacerbate rather than fix 
the problem. At the cost of extra parameters, the varia- 
tion of 0 among genes could be treated as a hidden statis- 
tical parameter (such as the variance of a statistical 
distribution) to be estimated. 

Our model for patterns of polymorphism within group, 
by ignoring gene flow between geographic regions com- 
prising each group of accessions, underestimates the var- 
iance in coalescence times underlying amounts of 
polymorphism. Thus loci with neutral polymorphism but 
high 6 and in turn possibly high Z values were picked as 
departing from neutrality. A more sophisticated model 
would treat both populations jointly but addressing this 
would require a flexible yet computationally tractable 
population structure component for coalescence models. 
That remains a theoretical challenge. 

We observed a broad population structure (K = 2) con- 
sistent with previous analysis of SSR data [20] which, in 
the absence of obvious barrier to gene flow, seems likely 
a relic of past isolation reflecting the existence of glacial 
refuges both West (Spain) and East (Levant) of the Medi- 
terranean basin. Interestingly clear footprints of selection 
were detected only in one of the two groups (the western 
group 2) suggesting that selective pressures differ across 
Western and Eastern parts of the species range. No 
obvious environmental differences between these two 
geographic regions could be identified: both regions span 
a similar latitudinal gradient and share the mosaic of cli- 
mate, geology and anthropic pressures that characterize 
the Mediterranean [50]. The only notable difference 
between these two groups is the size of the geographic 
area they span, the Western group being restricted to the 
Iberian Peninsula and Northern Morocco. 

Molecular adaptation in the nodulation and flowering 
pathway 

Although the majority of Arabidopsis flowering genes 
have orthologs in legumes, little is known about the 



genes underlying variation in flowering time in Medi- 
cago truncatula. Previous studies have shown that as 
most Mediterranean legume species, Medicago trunca- 
tula exhibits reduced time to flowering under vernaliza- 
tion and under long day conditions [51]. We studied 
orthologs of genes that have been shown to be involved 
at different levels of the flowering signalling pathway in 
Arabidopsis thaliana: PHYa, PHYb, PHYe, VRN and 
FCA are involved in the perception of environmental sti- 
muli, COL (Constans-Like) and FRLa (Frigida-Like) are 
central genes and FTLa, FTLb, FTLc and SOClb are 
integrating genes [52]. We detected several signatures of 
selection in genes involved in the perception of environ- 
mental stimuli {PHYa, PHYe and VRN), Phytochromes 
are a family of plant photoreceptors that mediate phy- 
siological and developmental responses to changes in 
light conditions. In Arabidopsis thaliana, five photo- 
chrome (PHYa-e) genes have been shown to be involved 
in the timing of flowerng, with PHYa, PHYd, and PHYe 
playing a major role in the control of flowering under 
cool conditions [53]. VRN1 and VRN2 have been shown 
to mediate response to vernalization both in Arabidopsis 
thaliana and cereal crops [54]. 

We note that most footprints of selection detected in 
this study involve the most upstream genes of the flow- 
ering pathway. Such genes are expected to be associated 
with lower pleiotropic effect and are thus expected to be 
preferential targets of natural selection [55]. We also 
detected a significant deviation from neutral expecta- 
tions on the integrator gene FT (FLOWERING TIME). 
This gene acts on the floral meristem identity genes to 
initiate flowering and is one of six candidate genes that 
have been shown to be located in the support interval 
of a QTL that explained 10 to 60% of the flowering date 
variation in three RILs populations [56]. 

Further phenotypic and climatic data are needed to 
evaluate the role of these different genes in natural var- 
iation for flowering date in Medicago truncatula and to 
understand their involvement in the adaptation of the 
species to climatic variations. 

For nodulation genes, we detected footprints of selec- 
tion in two genes DMI1 and NIN Both genes are acting 
at the signal transduction level, without direct interac- 
tion with the symbiotic partner. DMI1 appears to have 
undergone selection both recently - between M. trunca- 
tula populations - and more anciently in the Medicago 
genus. A sliding window analysis reinforces this state- 
ment (Figure 4) while similar analyses for other candi- 
date genes did not uncover obvious local signatures of 
selection along the gene (data not shown). 

Conclusion 

We used Sanger sequencing to survey nucleotide varia- 
tion in Medicago truncatula. We focused on a set of 
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important genes with known function in the Nod factor 
perception and flowering pathways and asked if selec- 
tion had affected these genes at different evolutionary 
time-scales. We did so by contrasting patterns of poly- 
morphism at candidate genes with a set of control gene 
fragments and a coalescent framework where simple 
demographic models were fitted to the control data. 
The fitted demographic models were then used as a 
basis to obtain - via coalescent simulations - null distri- 
butions for statistics characterizing patterns of variation 
in genes of interest. Here we uncover several footprints 
of selection in both pathways and at different evolution- 
ary time scales. Our detailed analysis of diversity within 
groups demonstrates that model-based methods of ana- 
lysis can be quite sensitive to assumptions of the under- 
lying models used. Future models should explore more 
detailed demographic scenarios, possibly in a more spa- 
tially explicit context (see for instance [57] for analysis 
of spatial patterns of polymorphism in A. thaliana). 
Inspection of the empirical distribution of scaled muta- 
tion rates also suggests ample variation in per nucleotide 
mutation rate throughout the genome. Even if neutrality 
test statistics are built in order to minimize sensitivity to 
underlying variation in mutation rates, incorporating 
genomic heterogeneities in mutation and recombination 
along the genome may be needed to generate sound 
null distributions that integrate over both historical and 
mutational/recombinational unobserved processes that 
shape genomic diversity [58]. Fitting such models 
requires orders of magnitude of more polymorphism 
data but these will be available in the near future for the 
M truncatula species complex [http://medicagohapmap. 
org/] and increasingly so for non model species. Thus, 
barring the computational challenges to "scale up" to 
genome levels, model based approaches incorporating 
the additional touch of biological realism described 
above will be a fruitful framework for understanding 
evolutionary forces shaping genome variation and, ulti- 
mately, symbiotic evolution and flowering time in this 
system. 
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Additional file 3: Table S3 - List of sequenced loci. Microsoft Excel 
spreadsheet containing the list of sequenced loci, genomic location and 
results of sequencing, la is the alignment length and Is is the length of 
the sequenced obtained from A17 with alignment gap stripped. 

Additional file 4: Table S4 - Standard diversity indices for each 
fragment. Microsoft Excel spreadsheet containing standard diversity 
indices computed for each fragment. The results of different summary 
statistics are presented in separate sub-tables for control fragments, 
flowering loci and nodulation loci. NC: not computed. 

Additional file 5: Figure SI - Likelihood profile for K, the number of 
groups in the sample. Microsoft Word document showing the 
likelihood profile for K in the structure analysis. Likelihood was obtained 
using the software Instruct on the nucleotide polymorphism data from 
control loci. 

Additional file 6: Figure S2 - Group membership of the accession in 
structure analysis. Microsoft Powerpoint presentation containing the 
schematic representation of the assignment of the accessions in each 
group following Instruct analysis for K = 2, K = 3 and K = 4. 

Additional file 7: Table S5 - Standard diversity indices for each 
fragments within both ancestry groups. Microsoft Excel spreadsheet 
containing the values of standard diversity indices estimated for each 
fragment within both ancestry groups. NC: not computed. 

Additional file 8: Table S6 - P-values of the joint (D, Z) test. Microsoft 
Excel spreadsheet containing, for each locus, the value of D, Z, and the 
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